智能论文笔记

Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning

Julien Perolat , Bart de Vylder , Daniel Hennes , Eugene Tarassov , Florian Strub , Vincent de Boer , Paul Muller , Jerome T. Connor , Neil Burch , Thomas Anthony

分类：人工智能

2022-06-30

我们介绍了DeepNash，这是一种能够学习从头开始播放不完美的信息游戏策略的自主代理，直到人类的专家级别。 Stratego是人工智能（AI）尚未掌握的少数标志性棋盘游戏之一。这个受欢迎的游戏具有$ 10^{535} $节点的巨大游戏树，即，$ 10^{175} $倍的$倍于GO。它具有在不完美的信息下需要决策的其他复杂性，类似于德克萨斯州Hold'em扑克，该扑克的游戏树较小（以$ 10^{164} $节点为单位）。 Stratego中的决策是在许多离散的动作上做出的，而动作与结果之间没有明显的联系。情节很长，在球员获胜之前经常有数百次动作，而Stratego中的情况则不能像扑克中那样轻松地分解成管理大小的子问题。由于这些原因，Stratego几十年来一直是AI领域的巨大挑战，现有的AI方法几乎没有达到业余比赛水平。 Deepnash使用游戏理论，无模型的深钢筋学习方法，而无需搜索，该方法学会通过自我播放来掌握Stratego。 DeepNash的关键组成部分的正则化NASH Dynamics（R-NAD）算法通过直接修改基础多项式学习动力学来收敛到近似NASH平衡，而不是围绕它“循环”。 Deepnash在Stratego中击败了现有的最先进的AI方法，并在Gravon Games平台上获得了年度（2022年）和历史前3名，并与人类专家竞争。

translated by 谷歌翻译

游戏历史悠久的历史悠久地作为人工智能进步的基准。最近，使用搜索和学习的方法在一系列完美的信息游戏中表现出强烈的表现，并且使用游戏理论推理和学习的方法对特定的不完美信息扑克变体表示了很强的性能。我们介绍游戏玩家，一个通用算法，统一以前的方法，结合导游搜索，自助学习和游戏理论推理。游戏播放器是实现大型完美和不完美信息游戏中强大实证性能的第一个算法 - 这是一项真正的任意环境算法的重要一步。我们证明了游戏玩家是声音，融合到完美的游戏，因为可用的计算时间和近似容量增加。游戏播放器在国际象棋上达到了强大的表现，然后击败了最强大的公开可用的代理商，在头上没有限制德克萨斯州扑克（Slumbot），击败了苏格兰院子的最先进的代理人，这是一个不完美的信息游戏，说明了引导搜索，学习和游戏理论推理的价值。

translated by 谷歌翻译

Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.

translated by 谷歌翻译